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ABSTRACT 



Analysis of secondary data was used as a way to inform the 
researcher about the trends in her assessment practices over a 4 -year period. 
This was an important initial step in an effort to develop and integrate 
high-quality classroom assessment tasks and make sense of assessment 
information for decision making. Scores from 26 groups of graduate and 
undergraduate education students from 3 universities in the United States 
were analyzed. Course goals, objectives, and syllabuses were analyzed. 
Students' backgrounds and group combinations (age, gender, socioeconomic 
status) were taken into consideration in determining the consistency of 
specific assessment tasks in providing feedback to the instructor as 
researcher. The study results provided evidence of high content validity as 
well as high construct validity of time and nontimed assessment tasks. 
Concurrent validity among similar assessments tasks was evident. However, the 
predictive validity of assessment tasks (individual task to the final score) 
varied depending on whether the assessment task was nontimed (r=0.20 to 
r=0.41) or timed. Timed assessment tasks were high predictors of the 
student's performance in the course (r=0.57 to r=0.01) . Timed assessment 
tasks were more reliable (consistent) than nontimed tasks in providing 
assessment feedback across similar groups and contexts (contextual 
reliability) . Scores from the nontimed assessment tasks fluctuated more from 
group to group (r=0.04 to r=0.68) than scores from timed tasks. Nontimed 
tasks probably tapped student skills and strategies that were not retrievable 
through timed examinations. The study highlights the importance of 
understanding, documenting, and evaluating assessment practices to better 
inform decision making at both classroom and program levels. (Contains 2 
figures, 3 tables, and 17 references.) (SLD) 
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Evaluating Validity and Reliability of Classroom Assessments 

Introduction 

In the past decade the debate on classroom assessment has focused more on 
performance assessments and less on paper and pencil tests. This trend arose 
from concerns about the inadequacy of a single test or a battery of tests to provide 
a comprehensive picture of students' abilities in a subject area, and more 
importantly, lack of transfer of the theoretical knowledge to solving actual real life 
problems despite high scores on tests. By giving students opportunities to actually 
"perform" a task, the teacher can assess the students' skills in more realistic ways. 
Measurement and evaluation books now incorporate information on performance 
assessments (see for example, Kubiszyn & Borich, 1996; Thorndike, 1997; 
Gallagher, 1993; Oosterhof, 1994; Ward & Murray-Ward, 1999). 

As classroom assessment is becoming more and more comprehensive, determining 
evidence of validity (accuracy) and reliability (consistency) are also becoming more 
complex. Validity and reliability coefficients are pretty straightforward to compute 
using paper and pencil test scores. Determining validity and reliability of varied 
group projects is not easy. Since classroom assessment is supposed to inform 
decision-making, validity and reliability of the various assessments must be 
ensured. This study is an analysis of various class assessment tasks and the 
resulting numeric scores obtained by students on formative and summative course 
assessments in 26 classes (groups) taught by the researcher in three different 
universities over a four-year period in the United States. 
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Conceptual Framework 

In this study, classroom assessment is defined as "selective collection of 
representative information on students' learning, accurate processing and accurate 
interpretation of the information for informed decision making " This definition was 
developed from reviewing several definitions of assessment from different textbooks 
and articles and realizing that none of those definitions covered exactly what the 
researcher wanted the term "classroom assessment" to mean in this study. This 
definition implies that the information collected and analyzed has to provide a clear 
indication as to whether the students have achieved effective learning of the 
targeted skills and concepts, and whether the students themselves realize the 
learning from their own perspectives and ways of interpretation. In other words, the 
assessment data have to show if the students have experienced the shift from a 
state of not knowing or not being able to perform a skill, to a state of knowing or 
being able to perform that skill. 

Why do we assess? Teachers gather information about students' learning so that 
they can get a comprehensive, representative picture of the students' learning 
processes and the skills already acquired. This information is then used to make 
decisions such as grade retention or promotion, remedial work, further training, job 
placements etc. The type of decision to be made will determine the type of 
information to be collected and how it is analyzed (Airasian, 1996; Gallagher, 1998; 
Stiggins, 1997; Kubiszyin and Borich, 2000; Eby and Martin, 2001; Weber, 1999). 
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Comprehensive Assessment : Classroom assessments are becoming more and 
more comprehensive. Some psychological testing books are also incorporating 
thought questions (see for example Kaplan and Saccuzzo, 1997; Heitzman, 1997). 

In the four years during which the data were collected, the researcher needed to 
make decisions about students' mastery of content, application of skills in real 
classroom situations with children, communication skills, ability to reflect on their 
own practices and learn from them, creativity, originality, etc. Since paper and 
pencil tests could do little to achieve all these goals, the researcher used varieties of 
assessments. The assessments included, but were not limited to, take-home 
assignments, group and individual projects, class presentations, observing real 
classrooms and writing case study reports, reading and summarizing research 
reports, micro teaching, course portfolios, as well as quizzes, tests and 
examinations. The reason for such diversification of assessments was to collect 
information from as many perspectives as possible, to tap the different learning 
styles and preferences of students. 

Assessments are authentic if they are relevant, meaningful, and interwoven into the 
curriculum (Puckett and Black, 2000; Puckert and Black, 1994; Eby and Martin, 

2001; Mindes, Ireton & Madrell-czudnowsky, 1996; Bodrova & Leong, 1996; 
Danielson, 1996). The information collected about every student needs to be very 
comprehensive, hence as representative as possible of the capabilities of the 
student being assessed. 
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To make the final decision about each student's learning in each course, the 
researcher added together points from the different assignments, determined the 
percentage of the points obtained to the total points possible, and assigned the 
letter grade according to the respective university policies. The grades would then 
inform other decisions at individual, or program level. Figure 1 demonstrates that 
component scores determine the overall score, which in turn determines the course 
grade. 

Figure 1 about here 

Decision-making : The decision to be made has to be a logical step linked to the 
information derived from the assessments. Logical organization, appropriate 
analysis and accurate interpretation and reporting of each student's assessment 
data therefore cannot be overemphasized. There are different methods of pooling 
together assessment information for decision making (see for example Kubiszyn 
and Borich, 1994; 2000). Teachers need to be aware of these different methods, 
select appropriate ones, and assess how well the methods work for their classes. 



Research Method: 

The researcher analyzed the classroom assessment task descriptions and 
requirements, as well as the resulting sets of scores (from her files) to study her v 
own pattern of grading to explore relationships among the sets of scores from 
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different types of assessments. The assessment tasks, the test items, the test 
scores and the resulting course grades were considered secondary data since their 
primary purpose of assessing students' learning had been completed. 

Studying Content Validity 

Each task was analyzed in terms of its description, requirements, scoring criteria, 
and then compared to the respective syllabus and teaching feedback from students, 
to examine content validity. Test items were reviewed each time before the test was 
given to the respective students. Item analysis was carried out after each test in 
order to identify weak test items, which were then eliminated from the test. Aligning 
the test items and other assessment tasks with: the specific topics on the syllabus, 
student feedback on their learning of those topics, the scoring criteria, the obtained 
scores and the standard error of measurement for each set of scores helped the 
researcher estimate the content validity of the assessments. 

Studying Construct Validity 

The tasks were also analyzed in relation to relevant theories to examine their 
construct validity. Since the courses taught in the four years of the study fell in five 
main categories, i.e., assessment of learning, education of young children, 
language acquisition, effective teaching in culturally diverse classrooms, and 
research methods, fundamental theories were reviewed in these five areas. The 
researcher aligned the objectives of the course to the goals stated in the prescribed 
teaching standards, and examined the assessment tasks and test items to 
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crosscheck their links to those objectives. By aligning the assessment tasks to 
objectives, goals, standards, the philosophy of the course and the implied theories, 
it was possible to determine construct validity of the assessment tasks. 

Studying Concurrent Validity 

Correlation coefficients of the various quizzes, tests and examinations were 
estimated using the Pearson Product Moment Correlation Coefficient method. The 
Corel Quattro-Pro spreadsheet was used to achieve the correlation coefficient 
estimations. Figure 2 shows one group's set of scores from different class 
assignments, the maximum possible scores, obtained scores, and a correlation 
matrix. The correlation coefficients were used to estimate the concurrent validity of 
the different tasks. By squaring the correlation coefficients the researcher could 
more closely estimate the amount of variability in one task that could be explained 
by the variability in another task. 



Figure 2 about here 
Studying Predictive Validity 

The focus of the study was not on how well single assignments predicted the overall 
grade in the course. However, the researcher thought it would be interesting to find 
out if some assessment tasks were better predictors than others, and what the 
underlying reasons might be. 
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Studying Reliability of the Assessment Tasks 

Sets of scores from similar tasks and tests were correlated using the Quattro Pro 
Spreadsheet, to estimate the reliability of those tasks across groups of students. 
Students' scores were matched on the basis of their average performance halfway 
through the course, including the midterm exam. This approach enabled the 
researcher to treat each assessment task as an independent measure across 
groups. The extent to which one assessment task was related to the overall group 
performance in the course was compared to the extent to which that task was 
related to overall performance of similar groups. The higher the correlation, the 
more consistent the task would be. 

The researcher did not consider the reliability of the task or test in general terms 
(which would then, supposedly, apply to any group of students). Rather, the 
researcher focused on the characteristics of the particular group, and assumed the 
reliability of the tasks would hold only for groups that could be closely matched. 
Since there is not a single value that will provide the perfect reliability of an 
assessment instrument on its own (Thorndike, 1997), the group combinations of 
students, the students' backgrounds as well as teaching and assessment 
procedures had to be considered. Lacking a better term to describe this type of 
consistency for the purpose of the study, the researcher used contextual reliability 
to refer to the degree of consistency of an assessment task across similar groups of 
students. 
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Findings 

Evidence of Content Validity 

Thorough analysis and aligning of the assessment tasks with objectives, goals, and 
prescribed teaching standards and scoring criteria, showed high content validity. 
The Standard Error of Measurement values were very low, implying a close match 
between the obtained scores and the unknown true scores. The researcher did not 
see the need to adjust the scores. 

Table 1 provides a summary of the content validity checks for the different 
assessment tasks she used in with the 26 groups. 

Table 1 about here 

As Table 1 shows, the different assessment tasks had content validity of differing 
extents. For the groups taught in the beginning of the research period, the quizzes 
had little content validity to the course. Since these beginning of course quizzes 
were intended to give the teacher/researcher overall preparedness of the students 
(sizing-up assessment, Airasian, 1996), to take the course, the content was of a 
general nature. Case studies, reaction papers, research summaries, research 
papers, class presentations, course portfolios, research proposals (in the few times 
they were utilized), and class poster sessions indicated moderate-to-high content 
validity across the 26 groups. Midterm examinations and final examinations were 
very closely matched to the content of the course, indicating high content validity. 
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On the whole, the validity checks of all the 26 sets of data indicated that the 
assessment tasks were matched (M), or closely matched (CM), or even perfectly 
matched (PM) to the content of the course (see Table 1). 

Secondly, an interesting trend emerged from the content validity checks of the 
assessment tasks. The more recently the group was taught, the closer the match 
between assessment tasks and the objectives/goals of the course. This trend 
indicates learning on the part of the researcher, i.e., increased competence in 
developing content valid assessment tasks. 

Evidence of Construct Validity 

Examination of the assessment tasks in relation to the guiding theories in each of 
the five academic areas revealed a strong link, indicating construct validity of the 
tasks. This type of evidence of validity was easier to determine since the course 
syllabuses were developed on the basis of the fundamental theories in the 
respective academic areas. 

Evidence of Concurrent Validity 

Matching scores from the different assessment tasks indicated varied degrees of 
correlations among them. The assessment tasks that had mostly moderate to high 
concurrent validity were: timed examinations (midterm and final exams), take home 
group tasks (case studies and class poster preparation), and class group tasks 
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(class poster sessions, presentations). Concurrent validity coefficients were 
consistently low when scores were compared between the following assessment 
tasks: timed versus non-timed, take-home versus class tasks, group versus 
individual tasks, and essays versus objective examinations. Table 2 presents this 
information. 

Table 2 about here 



Evidence of Predictive Validity 

Predictive validity coefficients between the first quizzes and the final score were 
consistently low (from .2 to .4). This means the students "grew" considerably in 
unpredictable ways during the course. The midterm exams and final exams were 
found to be strong predictors of the total score in the course (correlation coefficients 
between 0.57 and 0.91). One explanation of this high predictability could probably 
be that students prepared for the exams in similar ways. Examination anxiety could 
be another factor that kept students at their relative rankings across timed 
examinations. 

Evidence of Reliability of the Assessment Tasks 

Consistency of the different assignments over time was of great interest to the 
researcher. Halfway through the course students had relatively stabilized in 
achievement. Midcourse ranking of students helped the researcher in matching 
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groups' scores in terms of timed and non-timed, individual and group assignments. 
For each timed examination, the researcher developed alternative exams and 
through item analysis the examinations were consistently "cleaned" to get rid of any 
flaws that might interfere with the quality of the testing process. 

Most of the reliability coefficients of the timed exams were considerably high 
(between 0.73 and 0.86), although there were a few moderate and one low 
coefficient of 0.2. This means that students kept their relative positions in their 
groups. However, some of the groups of students were more similar than others. 

The non-timed and group assessment tasks were less consistent in determining 
student performance across groups. Take home group tasks had low or no 
correlation. This could be explained by the possibility that students learned to work 
together in the similar ways and scored about the same. The relatively higher 
correlation coefficients of the take-home individual assignments strengthen the 
possibility that working together in class and outside class became a factor in 
obtaining similar scores across these assignments, hence low correlations in group 
assignments. As such, listing the assessment tasks in order of consistency across 
groups, timed exams were the most consistent assessment tasks, followed by 
individual class assessment tasks, individual take-home assessment tasks, group 
class assessment tasks, and finally group take-home assessment tasks. Table 3 
summarizes this information. 
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Table 3 about here 
Discussion 

Assessment of learning is useful only when it yields, to a good extent, a 
representative picture of what the individual student has learned. To achieve a 
representative picture of student learning, assessment tasks must be closely related 
to the content based on sound theories, and must be consistent in determining if 
effective learning of the targeted skills has occurred. Valid and reliable assessments 
are more likely to provide a representative picture of each student’s learning, than 
are invalid and unreliable assessments. 

In this study, timed assessment tasks (tests and exams) seemed to be more 
consistent in measuring students' learning. They had slightly higher content validity 
and they were also better predictors of the final score in the course. However, since 
teachers and evaluators cannot depend on timed exams and quizzes only, there is 
need to maximize the effectiveness of non-timed, take-home assignments, and use 
them alongside the timed ones, so that the final grade awarded is closely 
representative of the individual students' acquired learning in the course. Some 
ways of making the non-timed and take-home assignments more effective feedback 
mechanisms, as employed by the current researcher, include: ensuring content 
validity of the assignments, providing formats for doing the assignments, providing 
and discussing scoring criteria before hand, and having the students talk about their 
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own work - before scoring is done. Giving students a chance to talk about their 
projects or take-home papers may highlight important points that might bypass the 
instructor's/scorer's attention, or that might be misinterpreted due to cultural and 
language differences. Although take-home and group assignments may not yield 
high correlations, they will be reliable in the sense that each time they are used they 
will achieve maximum assessment information about that particular student, and not 
necessarily the student's ranking within the group. 

It is possible that in some cases a student may get help from family members or 
friends and do well on a take-home assignment one time, and not get the help 
another time, and do poorly. This could be the cause for the inconsistency of 
scores as observed in this study, with regard to take-home and group assignments. 
Fluctuation of scores on take-home assignments could also be due to difficulty 
levels of those tasks and how much the specific tasks appealed to individual 
students. Similar group performance (therefore low correlation) indicated 
possibilities that students helped each other; or low achieving students agreed with 
high achievers, or high achievers did not do perform at their best. This possibility of 
helping each other, or under-performing, underscores the necessity to give timed, 
non-timed, individual and group assignments to students. 

While cooperative learning and group projects are a wonderful way of learning, the 
need to estimate individual capabilities as closely and as validly as possible cannot 
be overemphasized. With regard to student teachers in particular, once one 
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graduates from a teacher preparation program, the main assumption is that the 
graduate is ready to take classroom teaching responsibilities and be accountable for 
students' learning. Cooperation will be a highly desired addition to one’s individual 
competence. 



Summary of Emerging Trends and Suggestions for Classroom Teachers 

When teaching and assessment are closely interwoven together, the validity of the 
assessment tasks is enhanced. Both the teaching and the assessment tasks match 
the same targeted skills and knowledge. When assessment is treated as a final 
stage of a teaching period, the match between what is taught and what is assessed 
may be jeopardized. The following are important trends directly and/or indirectly 
arising from this self-study. The researcher learned important lessons from both the 
research content and from carrying out the research process itself. These might be 
useful to other classroom teachers interested in learning more about their 
assessment practices: 

1 . Both timed and non-timed assessment tasks were closely related to the 

target content, hence content validity. Construct validity was confirmed 
through examination of relevant theories. Concurrent and predictive 
validity of the different assessment tasks differed considerably, indicating 
that different assessment tasks measure group performances differently. 
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However, put together, the assessment tasks measured students' learning 
more accurately than would timed or non-timed assessment tasks alone. 

2. Timed assignments such as quizzes, tests, and examinations, were more 
consistent in measuring the overall learning in the course, than were the 
non-timed assignments. This does not mean that timed assessments are 
inherently better than non-timed assessments. Tests and examinations 
must be carefully developed and utilized according to testing principles. 
Awareness and use of professional guidelines in developing quizzes, tests 
and examinations are the only means to ensure high quality test items that 
target the intended knowledge and skills without trivializing them. 
Competence in item writing and item analysis is a pre-requisite for 
developing and using these types of measures. Haphazard item writing 
will result in flawed tests with low validity and reliability. Quizzes, tests, 
and exams are powerful tools of assessment when utilized with trained 
expertise. 

3. A combination of both timed and non-timed, individual and group, take- 
home and in-class assessment tasks helped the researcher to separate 
specific areas of the targeted content that could best be assessed using a 
certain mode of assessment. Pooling together the different assessments 
consistently indicated that students were given opportunities to 
demonstrate learning using different means (tasks). Each student's overall 
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grade for the course was therefore highly representative of the student's 
achieved learning. Obtaining assessment information from as many 
perspectives as possible helped accommodate students' different learning 
styles and modes of expression and talents that might not have been 
captured by a single assessment tool. The high correlations among timed 
assessment tasks imply that students kept their class ranks in certain 
areas (possibly little room for creativity), and fluctuated in other areas 
(more room for creativity). 

4. The researcher realized that providing students with as much information 
as possible at the beginning of the course helped improve validity of 
assessments. During the course students became increasingly aware of 
what was expected of them in doing the assessment tasks. Providing 
scoring criteria together with the assessments in the beginning of the 
semester made it easy for the students to understand the tasks and plan 
their responses. 

5. Finally, the researcher strongly suggests consistency, systematicity and 
self-evaluation in assessment practices. She was able to study her own 
assessment practices because she kept track of the assessment practices 
she used with different groups in her four year teaching period. Through 
modifying assessment tasks and doing item analysis of timed 
examinations she was able to improve the validity of her assessments. By 
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studying the consistency of each assessment tool and similar ones in 
relation to each individual group of students (contextual reliability) she 
could predict students' performance on a component of the course and 
take necessary measures before assessment problems occurred. 

The researcher suggests that classroom teachers study their own 
assessment practices to better conceptualize comprehensive assessment 
and continuously improve validity and reliability. This will lead to more 
accurately informed decisions about teaching, learning, students' 
readiness for the job market, and program revisions. 
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Abstract 



The aim of the evaluation was to combine theory and practice to develop better understanding of 
classroom assessment in general and assessment practices in particular. Analysis of secondary 
data was used as a way to informthe researcher about the trends in her assessment practices 
over a four-year period. This was an important initial step in the effort to develop and integrate 
high-quality classroom assessment tasks, and making sense of assessment information for 
decision-making. 

The author analyzed scores from 26 groups of graduate and undergraduate Education students 
in three universities in the United States. Course goals, objectives and syllabuses were analyzed. 
Students' backgrounds and group combinations (age, gender, socio-economic) were taken into 
consideration in determining the consistency of specific assessment tasks in providing feedback 
to the instructor as researcher. 

The study results provided evidence of high content validity as well as high construct validity of 
timed and non-timed assessment tasks. Concurrent validity among similar assessment tasks was 
evident. However, the predictive validity of assessment tasks (individual task to the final score) 
varied depending on whether the assessment task was non-timed (r = .20 -> r = .41) or timed. 
Timed assessment tasks (including tests and examinations) were high predictors of the student's 
performance in the course (r = .57 -> r = 0.91). 

Timed assessment tasks were more reliable (consistent) than non-timed tasks in providing 
assessment feedback across similar groups and contexts ("contextual reliability"). Scores from 
non-timed assessment tasks fluctuated more from group to group (r = .04 -> r = .68) than scores 
from timed tasks (r = .63 -> r =74). Non-timed tasks probably tapped student skills and strategies 
that were not retrievable through timed examinations. 

The study highlights the importance of understanding, documenting and evaluating assessment 
practices to better inform decision making at both classroom and program levels. 



t^jhe Course Grade: 

• determined by an overall score. 

The Overall Score: 

• determined by component scores 

Component Scores: 

• obtained from different types of assignments 



timed, and non-timed. 



Figure 1 : Building the Course Grade 
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Figure 2: Sample Correllation Matrix 
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TABLE 2: CONCURRENT VALIDITY COEFFICIENTS OF ASSESSMENT TASKS 
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TABLE 3: TYPES OF ASSEMMENT TASKS AND THEIR RELIABILITY ESTIMATES 
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